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Abstract 

Issues in sentence categorization according 
to language is fundamental for NLP, es- 
pecially in document processing. In fact, 
with the growing amount of multilingual 
text corpus data becoming available, sen- 
tence categorization, leading to multilin- 
gual text structure, opens a wide range of 
applications in multilingual text analysis 
such as information retrieval or preprocess- 
ing of multilingual syntactic parser. 

The major difficulties in sentence catego- 
rization are convergence and textual errors. 
Convergence since dealing with short en- 
tries involve discarding languages from few 
clues. Textual errors since documents com- 
ing from different electronic ways may con- 
tain spelling and grammatical errors as well 
as character recognition errors generated 
by OCR. 

We describe here an approach to sentence 
categorization which has the originality to 
be based on natural properties of languages 
with no training set dependency. The im- 
plementation is fast, small, robust and tex- 
tual errors tolerant. Tested for french, en- 
glish, Spanish and german discrimination, 
the system gives very interesting results, 
achieving in one test 99.4% correct assign- 
ments on real sentences. 

The resolution power is based on grammat- 
ical words (not the most common words) 
and alphabet. Having the grammatical 
words and the alphabet of each language 



at its disposal, the system computes for 
each of them its likelihood to be selected. 
The name of the language having the op- 
timum likelihood will tag the sentence - 
but non resolved ambiguities will be main- 
tained. We will discuss the reasons which 
lead us to use these linguistic facts and 
present several directions to improve the 
system's classification performance. 

Categorization sentences with linguistic 
properties shows that difficult problems 
have sometimes simple solutions. 



1 Categorization according to 
Language 

1.1 From Text Categorization . . . 

Emergence of text categorization according to lan- 
guage came with the need of processing texts coming 
from all over the world. The goal of text categoriza- 
tion is to tag texts with the name of the language in 
which they are written. Information retrieval is the 
main application field. 

To do this job, the traditionnal way is to exploit 
the difference between letter combinations in diffcr- 
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ent languages (Cavnar and Trcnklc, 1994). For each 
language, the system computes from a training set 
a profile based on frequency (or probability) of let- 
ter sequences. Then, for a given text, it computes a 
profile and select the language which has the closer 
profile. 

While some text categorization systems give very 
good results, the major problem is that their quality 
is entirely based on the training set. Profiles require 
a lot of data to converge and building a large repre- 
sentative training set is a real problem. Moreover, 
this method assume that texts are monolingual and 
results will be affected when dealing with multilin- 
gual texts. It does not care about natural language 



properties : it only considers texts as streams of 
characters. There is no linguistic justification. 

1.2 ... to Multilingual Sentence 
Categorization 

Today, the problem is quiet different. Texts are more 
and more multilingual (especially due to citations) 
and we don't have enough tools to process them ef- 
ficiently. Tagging sentences with the name of their 
language solves this problem by switching each ap- 
plication in function of the language. This affects 
the whole NLP, Information retrieval is not the only 
field to be concerned: syntactic analysis and every 
applications based on it are concerned, making study 
about one particular language in multilingual texts 
without parasitic noise is also possible. 

Using the previous method is not possible because 
the sentence is a too small unit to converge. The 
analysis method must be more precise to reveal each 
possible change of language. 

We remark that a change of language in a text 
could appear at each change of sentence (more often 
paragraph) or in each included segment via quotes, 
parenthesis, dashes or colons. We will call sentence 
the traditionnal sentence but also each segment in- 
cluded in it. 

2 Multilingual Sentence 
Categorization 

Studying quantities of texts, we try to understand 
as well as possible ways to discriminate languages. 
We present in this section the results of our research 
which has been implemented and in the next section, 
other directions which seems obviously promising. 

2.1 Grammatical Words as Discriminant 

In this section, we are going to motivate the rea- 
sons which lead us to choose grammatical words as 
discriminant. 

Grammatical words are proper to each language 
and are in a whole different from one language to 
another. Moreover, they are short, not numerous 
and we can easily build an exhaustive list. So, these 
words can be use as discriminant of language. But 
can we use them as discriminant of sentences? 

Grammatical words in sentences represent on av- 
erage about 50% of words. They can't be omitted 
because they structure sentences and make them un- 
derstandable. Furthermore, relying on grammati- 
cal words allows textual errors tolerance and foreign 
words import from other languages (usual in scien- 
tific texts). It's also important to note that foreign 
words import concerns nouns, verbs, adjectives but 
never grammatical words. 



These rules will allow us to categorize sentences 
which have enough grammatical words but in short 
sentences (less than 10 words), there are few gram- 
matical words, and by the way, few clues. We 
must introduce new knowledges to improve short 
sentences categorization. 

2.2 Using the Alphabet 

To improve categorization of short sentences, a sim- 
ple way is the use of the alphabet. Alphabets are 
proper to each language and even if they have a great 
common part, some signs such as accents allows dis- 
crimination between them. This is not the only way 
to improve categorization and we will see in section 
§H other possible issues. 

2.3 Notes 

• It is interesting that, using these knowledges, 
this system will be coherent with multilingual 
syntactic parsers which only rely on grammat- 
ical words and endings. So, the categorization 
s ystem can constitute a switc h for these parsers 
(|Vergne, 199$ |Vergne, 1994j) . 



• We can also remark that using grammatical 
words is different from using most common 
words. In fact, most common words require 
training set dependency and it is well known 
that a representative training set is very diffi- 
cult to get. The number of words to hold is 
quiet subjective. Moreover, frequency is rela- 
tive to texts, not to sentences. 

3 Improving Categorization 

There are two levels to improve sentences catego- 
rization: a level below using words morphology and 
a level above using text structure. These improve- 
ments haven't been implemented yet and will be the 
object of further works. 

3.1 Knowledge upon Words Morphology 

Mainly two ways can be explore to improve catego- 
rization, using natural languages properties: 

• Syllabation: the idea is to check the good syl- 
labation of words in a language. It requires to 
distinguish first, middles and last syllabs. (Us- 
ing only endings seems to be a possible way) 

• Sequences of voyells or consonants: the idea is 
that these sequences are proper to each lan- 
guage. 



3.2 Using Text Structure 

When dealing with texts, we can also use heuristical 
knowledge about text structure: 

• In a same paragraph, contiguous sentences are 
written in the same language 

• Titles of a paragraph are written in the same 
language as their body 

• Included blocks in a sentence (via parenthesis, 
. . . ) are written in the same language as the 
sentence. 

An interesting tool to build is a general document 
structure recognizer. Theoritical issues in this field 



are in progress (Lucas et al., 1993; Lucas, 1992) but 
as far as we know no implementation has been done 
yet. 

4 Implementation 

The implementation of this research can be divided 
in two parts: sentence tokenization and language 
classification. 

4.1 Sentence tokenization 

Sentence tokenization is a problem in itsef because 
documents may come through different electronic 
ways. Also a sentence doesn't always start with a 
capitalized letter and finish with a full stop (espe- 
cially in emails). Texts are not formated and mis- 
cellaneous characters can be found everywhere. 

Acronyms, abbreviations, full names and num- 
bers increase the problem by inserting points and/or 
spaces everywhere without following any rule. But, 
no rule can ever exist in free style texts. 

We wrote a robust sentence parser which solves 
the majority of these cases, allowing us to categorize 
in good conditions multilingual sentences. 

4.2 Language classification 

The realization simply implements the previous 
ideas. 

To manage the possible points of chang e o f lan- 
guage via included segments (see section §L2), the 
language classification procedure uses a recursive al- 
gorithm to easily handle changes of context. 

The classification principle is the following: 

• For each word of the sentence: 

— Checked whether the word belongs to the 
grammatical words list of some languages. 

— If so, incremented their likelihood to be se- 
lected. 



Language 


Grammatical Words 


French 


301 


English 


186 


Spanish 


204 


German 


158 



Table 1: Number of Grammatical Words 



Language 


Number of 


of Corpus 


Sentences 


French 


4502 


English 


6735 


Spanish 


94 


German 


393 



Table 2: Size of Corpus 

— Checked whether the word morphology lets 
think it belongs to some languages. 

— If so, incremented their likelihood to be se- 
lected. 

• Tag the sentence with the names of the lan- 
guages which have the same and highest likeli- 
hood. 

This algorithm has a linear complexity in time. 

5 Evaluation 

5.1 The Test-Bed 

The test-bed set has been prepared to process 
French, English, Spanish and German. We use dic- 
tionnaries to get the grammatical words of each lan- 
guage (see table |l|) and their alphabet. 

We decided to use different kinds of documents to 
test robustness, speed, precision and textual errors 
tolerance. So, we collected scientific texts, emails 
and novels (see table ^) . 

5.2 Results 

The results we obtained were expected. They ex- 
press the fact that a sentence is usually written with 
grammatical words and that grammatical words are 
totally discriminant for sentences of more than 8 
words. 

From 1 to 3 words, there are mainly total unde- 
terminations. In fact, the corpus shows that we are 
processing included segments (via quotes and paren- 
thesis) and there are no grammatical words and few 
clues to rely on. Deductions really start between 
4 and 6 words. Here, sentences and grammatical 



words appear but in few quantities to allow a per- 
fect deduction. 

These results show that alphabets are not good 
enough to discriminate short sentences. Methods 
described in §[| must be implemented to improve re- 
sults in this case. 



Language 


Min 


Decisive 


Max 


of Corpus 


Length 


Word 


Length 


French 


1 


8 


125 


English 


1 


7 


76 


Spanish 


1 


4 


42 


German 


1 


5 


66 



Table 3: Isolation of a single language 



In table |3J, with the french corpus, the program 
always succeeds in isolating a single language for all 
the sentences containing from 8 to 125 words. For 
less than 8 words there are still ambiguities or total 
undetermination. 

5.3 Errors 

Isolating a single language does not mean exactly 
isolating the right language. The error rate is about 
0.01% and concerns very short sentences ("e mail" 
where " e" is analysed as Spanish) , a change of lan- 
guage without quotes in a sentence or an unexpected 
language (the Latin "Orbi et Urbi"). 

6 Conclusion 

This classification method is based on texts obser- 
vation and understanding of their natural proper- 
ties. It does not depend on training sets and con- 
verges fast enough to achieve very good results on 
sentences. 

This tool is now a switch of Jacques Vergne's mul- 
tilingual syntactic parser (for french, english and 
Spanish) . 

The aim of this paper is also to point that the 
more the linguistic properties of the object are used, 
the best the results are. 
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